Show the code
import pandas as pd
import numpy as np
import sqlite3
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)
sqlite_file = 'lahmansbaseballdb.sqlite'
con = sqlite3.connect(sqlite_file)import pandas as pd
import numpy as np
import sqlite3
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)
sqlite_file = 'lahmansbaseballdb.sqlite'
con = sqlite3.connect(sqlite_file)The client would like SQL queries that can be used to retrieve baseball data for direct use on their website, without requiring Python processing. They also request that the results be visualized with Lets-Plot charts to highlight key patterns and insights in the data.
This project explored Major League Baseball data using SQL (sqlite3) for querying, and Python (pandas, numpy, lets-plot) for data analysis and visualization. The client requested reusable SQL queries for their website along with charts that highlight trends and insights.
* BYU-Idaho Alumni in MLB: Queried salary and career data for two alumni (Mark Lindsma and Garrett Stephenson) and visualized their career earnings over time.
* Batting Averages: Calculated batting averages at the season level (1+ at-bats, 10+ at-bats) and at the career level (100+ at-bats). Results surfaced consistent top performers such as Ty Cobb and Rogers Hornsby, showing the importance of applying thresholds to filter out statistical flukes.
* Team Salary Comparison: Compared the Seattle Mariners and Texas Rangers average salaries across decades, showing steady growth for both but sharper spikes for the Rangers. This analysis raised questions about the link between payroll size and performance.
Additional Analyses:
* Salary Distribution by Position: Used a case statement to categorize positions into high, medium, or low salary groups, showing how financial value is distributed by role.
* Career Longevity: Identified the top 10 longest MLB careers (10+ games played) and calculated the league-wide average career length, providing insights into player durability.
Tools/Packages: SQL (sqlite3), Python (pandas, numpy), Lets-Plot
# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.htmlThe query below returns the results of two players who attended BYU-Idaho: Mark L. Lindsma and Garrett Stephenson, identified by their player IDs as lindsma01 and stephga01, respectively. Both players had long and varied careers in Major League Baseball. Lindsma played from 2007 to 2014, while Stephenson’s career spanned from 1997 to 2003. Lindsma’s salary peaked later in his career, earning $4,000,000 in 2014 with the Chicago White Sox. In contrast, Stephenson reached his highest salary of $1,025,000 in 2001, during the mid-point of his career with the St. Louis Cardinals. This analysis highlights the financial success of these two BYU-Idaho alumni in the professional baseball world.
# Query salary info for BYU-Idaho players using known playerIDs
query_byu_players = """
SELECT
playerID,
salary,
yearID,
teamID
FROM
Salaries
WHERE
playerID IN ('lindsma01', 'stephga01')
ORDER BY
salary DESC;
"""
df_baseball1 = pd.read_sql_query(query_byu_players, con)
df_baseball1["schoolID"] = "idbyuid"
df_baseball1| playerID | salary | yearID | teamID | schoolID | |
|---|---|---|---|---|---|
| 0 | lindsma01 | 4000000.0 | 2014 | CHA | idbyuid |
| 1 | lindsma01 | 3600000.0 | 2012 | BAL | idbyuid |
| 2 | lindsma01 | 2800000.0 | 2011 | COL | idbyuid |
| 3 | lindsma01 | 2300000.0 | 2013 | CHA | idbyuid |
| 4 | lindsma01 | 1625000.0 | 2010 | HOU | idbyuid |
| 5 | stephga01 | 1025000.0 | 2001 | SLN | idbyuid |
| 6 | stephga01 | 900000.0 | 2002 | SLN | idbyuid |
| 7 | stephga01 | 800000.0 | 2003 | SLN | idbyuid |
| 8 | stephga01 | 550000.0 | 2000 | SLN | idbyuid |
| 9 | lindsma01 | 410000.0 | 2009 | FLO | idbyuid |
| 10 | lindsma01 | 395000.0 | 2008 | FLO | idbyuid |
| 11 | lindsma01 | 380000.0 | 2007 | FLO | idbyuid |
| 12 | stephga01 | 215000.0 | 1999 | SLN | idbyuid |
| 13 | stephga01 | 185000.0 | 1998 | PHI | idbyuid |
| 14 | stephga01 | 150000.0 | 1997 | PHI | idbyuid |
#This part is extra, I wanted to see how it would look in a lets plot showing the comparrison of salary over the two players careers.
# Line plot to visualize their salary over time
query_compare_salaries = """
SELECT
playerID,
yearID,
salary
FROM
Salaries
WHERE
playerID IN ('lindsma01', 'stephga01')
ORDER BY
yearID;
"""
df_salaries = pd.read_sql_query(query_compare_salaries, con)
player_colors = {
'lindsma01': '#27251F',
'stephga01': '#C41E3A'
}
ggplot(df_salaries, aes(x='yearID', y='salary', color=as_discrete('playerID'))) + \
geom_line(size=2) + \
scale_color_manual(values=player_colors) + \
scale_x_continuous(breaks=list(range(df_salaries['yearID'].min(), df_salaries['yearID'].max()+1))) + \
ggtitle("Salary Comparison: Mark Lindsma vs Garrett Stephenson") + \
xlab("Year") + \
ylab("Salary (USD)") + \
theme_minimal()Client Request specific
a. Write an SQL query that provides playerID, yearID, and batting average for players with at least 1 at bat that year. Sort the table from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.
a. Use the same query as above, but only include players with at least 10 at bats that year. Print the top 5 results.
a. Now calculate the batting average for players over their entire careers (all years combined). Only include players with at least 100 at bats, and print the top 5 results.
This three-part question looked at batting averages, which are calculated by dividing hits by at-bats. For the first part, I found the top batting averages for players who had at least one at-bat in a season. Not surprisingly, all five players had a perfect 1.000 average, meaning they got a hit every time they were officially at bat that year. But most of them probably only had one or two chances, so the data isn’t super meaningful yet.
query_2a = """
SELECT
playerID,
yearID,
ROUND(CAST(H AS FLOAT)/AB, 3) AS batting_avg
FROM
Batting
WHERE
AB > 0
ORDER BY
batting_avg DESC, playerID ASC
LIMIT 5;
"""
df_2a = pd.read_sql_query(query_2a, con)
df_2a| playerID | yearID | batting_avg | |
|---|---|---|---|
| 0 | aberal01 | 1957 | 1.0 |
| 1 | abernte02 | 1960 | 1.0 |
| 2 | abramge01 | 1923 | 1.0 |
| 3 | acklefr01 | 1964 | 1.0 |
| 4 | alanirj01 | 2019 | 1.0 |
In the second part, I made the filter a bit stricter by only including players with at least 10 at-bats. This helped bring out more reliable results. The top players now had averages between .571 and .643, with names like Manny Ny (1974) and Carson M. (2013) leading the list. These guys still had great seasons, but the extra filter helps avoid flukes.
query_2b = """
SELECT
playerID,
yearID,
ROUND(CAST(H AS FLOAT)/AB, 3) AS batting_avg
FROM
Batting
WHERE
AB >= 10
ORDER BY
batting_avg DESC, playerID ASC
LIMIT 5;
"""
df_2b = pd.read_sql_query(query_2b, con)
df_2b| playerID | yearID | batting_avg | |
|---|---|---|---|
| 0 | nymanny01 | 1974 | 0.643 |
| 1 | carsoma01 | 2013 | 0.636 |
| 2 | altizda01 | 1910 | 0.600 |
| 3 | johnsde01 | 1975 | 0.600 |
| 4 | silvech01 | 1948 | 0.571 |
For the third part, I calculated batting averages over entire careers by summing up all hits and at-bats for each player and then dividing. I also made sure to only include players with at least 100 at-bats total. The results brought out legends like Ty Cobb (.366), Rogers Hornsby (.358), and Joe Jackson (.356). These are the kind of averages that reflect long-term consistency, not just one standout season.
query_2c = """
SELECT
playerID,
ROUND(SUM(CAST(H AS FLOAT))/SUM(AB), 3) AS career_batting_avg
FROM
Batting
GROUP BY
playerID
HAVING
SUM(AB) >= 100
ORDER BY
career_batting_avg DESC, playerID ASC
LIMIT 5;
"""
df_2c = pd.read_sql_query(query_2c, con)
df_2c| playerID | career_batting_avg | |
|---|---|---|
| 0 | cobbty01 | 0.366 |
| 1 | barnero01 | 0.360 |
| 2 | hornsro01 | 0.358 |
| 3 | jacksjo01 | 0.356 |
| 4 | meyerle01 | 0.356 |
I used the Seattle Mariners and the Texas rangers using average salary by year as my metric. The SQL query groups the player salaries by team and year and then calculates the average salary for each. The results are from the range years of 1985 to 2016, during this time the avaerage salary was over $6 million. The Marineres showed growth but were slightly behind in the end. This comparison gave me a better sense of how each organization has financially valued its players over time. It also raises questions about how payroll size might relate to team performance, although I’m a mariners fan, I realize we lost many games so it would be interesting to see a performance to salary analysis.
query_task3_mariners_rangers = """
SELECT
teamID,
yearID,
ROUND(AVG(salary), 2) AS avg_salary
FROM
Salaries
WHERE
teamID IN ('SEA', 'TEX')
GROUP BY
teamID, yearID
ORDER BY
yearID;
"""
df_task3_mr = pd.read_sql_query(query_task3_mariners_rangers, con)
df_task3_mr| teamID | yearID | avg_salary | |
|---|---|---|---|
| 0 | SEA | 1985 | 256277.78 |
| 1 | TEX | 1985 | 383825.00 |
| 2 | SEA | 1986 | 229165.73 |
| 3 | TEX | 1986 | 259350.73 |
| 4 | SEA | 1987 | 251500.00 |
| ... | ... | ... | ... |
| 59 | TEX | 2014 | 4677294.13 |
| 60 | SEA | 2015 | 4888348.00 |
| 61 | TEX | 2015 | 4791426.30 |
| 62 | SEA | 2016 | 4845833.54 |
| 63 | TEX | 2016 | 6070300.79 |
64 rows × 3 columns
The chart shows that both teams started with lower salaries in the 1980s, but average pay steadily increased over time. While both teams experienced ups and downs, the Texas Rangers had more dramatic spikes and sharper increases compared to the Mariners. The Mariners’ salary growth was more consistent and gradual, whereas the Rangers ended up on top with more noticeable jumps throughout the years.
#the plot
from lets_plot.mapping import as_discrete
team_colors = {
'SEA': '#0C2C56', # Mariners - Navy
'TEX': '#C0111F' # Rangers - Red
}
ggplot(df_task3_mr, aes(x='yearID', y='avg_salary', color=as_discrete('teamID'))) + \
geom_line(size=1.5) + \
scale_color_manual(values=team_colors) + \
scale_x_continuous(breaks=list(range(df_task3_mr['yearID'].min(), df_task3_mr['yearID'].max()+1))) + \
ggtitle("Average Salary: Seattle Mariners vs Texas Rangers") + \
xlab("Year") + \
ylab("Average Salary (USD)") + \
theme_minimal()# Include and execute your code here
salary_by_position_query = """
WITH PrimaryPosition AS (
SELECT
playerID,
yearID,
POS,
MAX(G) AS games_played
FROM (
SELECT
playerID,
yearID,
POS,
SUM(G) AS G
FROM
Fielding
GROUP BY
playerID, yearID, POS
)
GROUP BY
playerID, yearID
HAVING
G = MAX(G)
),
PositionSalary AS (
SELECT
pp.POS AS position,
s.playerID,
s.salary
FROM
Salaries s
JOIN
PrimaryPosition pp ON s.playerID = pp.playerID AND s.yearID = pp.yearID
)
SELECT
position,
ROUND(AVG(salary), 2) AS average_salary,
COUNT(DISTINCT playerID) AS total_players,
MAX(salary) AS highest_salary,
CASE
WHEN AVG(salary) > 3000000 THEN 'High Salary'
WHEN AVG(salary) BETWEEN 2000000 AND 3000000 THEN 'Medium Salary'
ELSE 'Low Salary'
END AS salary_category
FROM
PositionSalary
GROUP BY
position
ORDER BY
average_salary DESC;
"""
df_salary_by_position = pd.read_sql_query(salary_by_position_query, con)
df_salary_by_position| position | average_salary | total_players | highest_salary | salary_category | |
|---|---|---|---|---|---|
| 0 | 1B | 3336210.69 | 469 | 28000000.0 | High Salary |
| 1 | OF | 2405474.54 | 1119 | 27328046.0 | Medium Salary |
| 2 | 3B | 2321342.06 | 491 | 33000000.0 | Medium Salary |
| 3 | SS | 2010018.49 | 368 | 22600000.0 | Medium Salary |
| 4 | P | 1938130.80 | 2556 | 33000000.0 | Low Salary |
| 5 | 2B | 1794492.33 | 485 | 24000000.0 | Low Salary |
| 6 | C | 1429781.30 | 404 | 23000000.0 | Low Salary |
career_query = """
WITH CareerSpan AS (
SELECT
a.playerID,
MIN(a.yearID) AS start_year,
MAX(a.yearID) AS end_year,
(MAX(a.yearID) - MIN(a.yearID) + 1) AS career_length,
SUM(a.G_all) AS total_games
FROM
Appearances a
GROUP BY
a.playerID
HAVING
total_games >= 10
),
TopCareers AS (
SELECT
cs.playerID,
p.nameFirst AS first_name,
p.nameLast AS last_name,
cs.career_length
FROM
CareerSpan cs
JOIN
People p ON cs.playerID = p.playerID
ORDER BY
cs.career_length DESC
LIMIT 10
)
SELECT
*
FROM
TopCareers;
"""
df_top_careers = pd.read_sql_query(career_query, con)
df_top_careers| playerID | first_name | last_name | career_length | |
|---|---|---|---|---|
| 0 | altroni01 | Nick | Altrock | 36 |
| 1 | orourji01 | Jim | O'Rourke | 33 |
| 2 | minosmi01 | Minnie | Minoso | 32 |
| 3 | olearch01 | Charley | O'Leary | 31 |
| 4 | lathaar01 | Arlie | Latham | 30 |
| 5 | mcguide01 | Deacon | McGuire | 29 |
| 6 | eversjo01 | Johnny | Evers | 28 |
| 7 | jennihu01 | Hughie | Jennings | 28 |
| 8 | ryanno01 | Nolan | Ryan | 28 |
| 9 | streega01 | Gabby | Street | 28 |
avg_career_query = """
WITH CareerSpan AS (
SELECT
playerID,
MIN(yearID) AS start_year,
MAX(yearID) AS end_year,
(MAX(yearID) - MIN(yearID) + 1) AS career_length,
SUM(G_all) AS total_games
FROM
Appearances
GROUP BY
playerID
HAVING
total_games >= 10
)
SELECT
ROUND(AVG(career_length), 2) AS avg_career_length
FROM
CareerSpan;
"""
df_avg_career = pd.read_sql_query(avg_career_query, con)
df_avg_career| avg_career_length | |
|---|---|
| 0 | 6.84 |